Semantic Regularities in Document Representations

نویسندگان

Fei Sun

Jiafeng Guo

Yanyan Lan

Jun Xu

Xueqi Cheng

چکیده

Recent work exhibited that distributed word representations are good at capturing linguistic regularities in language. This allows vector-oriented reasoning based on simple linear algebra between words. Since many different methods have been proposed for learning document representations, it is natural to ask whether there is also linear structure in these learned representations to allow similar reasoning at document level. To answer this question, we design a new document analogy task for testing the semantic regularities in document representations, and conduct empirical evaluations over several state-of-theart document representation models. The results reveal that neural embedding based document representations work better on this analogy task than conventional methods, and we provide some preliminary explanations over these observations.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec

Distributed dense word vectors have been shown to be effective at capturing tokenlevel semantic and syntactic regularities in language, while topic models can form interpretable representations over documents. In this work, we describe lda2vec, a model that learns dense word vectors jointly with Dirichlet-distributed latent document-level mixtures of topic vectors. In contrast to continuous den...

متن کامل

Learning semantic structures from in-domain documents

Semantic analysis is a core area of natural language understanding that has typically focused on predicting domain-independent representations. However, such representations are unable to fully realize the rich diversity of technical content prevalent in a variety of specialized domains. Taking the standard supervised approach to domainspecific semantic analysis requires expensive annotation ef...

متن کامل

Deep Learning of Semantic Word Representations to Implement a Content-Based Recommender for the RecSys Challenge'14

In this paper, we will discuss a recommender system that exploits the semantics regularities captured by a Recurrent Neural Network (RNN) in text documents. Many information retrieval systems treat words as binary vectors under the classic bag-of-words model; however there is not a notion of semantic similarity between words when describing a document in the resulting feature space. Recent adva...

متن کامل